Skip to content

feat(AGX1-274): record task creator identity and FGAC migration safety#246

Open
asherfink wants to merge 2 commits into
mainfrom
asher.fink/agx1-274-task-dual-write
Open

feat(AGX1-274): record task creator identity and FGAC migration safety#246
asherfink wants to merge 2 commits into
mainfrom
asher.fink/agx1-274-task-dual-write

Conversation

@asherfink
Copy link
Copy Markdown

@asherfink asherfink commented May 21, 2026

Related work

Parent epic: AGX1-264 — per-task FGAC. Follow-ups bundled in AGX1-291.

This change is part of a 5-PR stack across 3 repos.

Repo PR Purpose
scaleapi/scaleapi scaleapi/scaleapi#144783 ✅ merged sgp-authz 0.7.1 — Action.CANCEL
scaleapi/scaleapi scaleapi/scaleapi#145000 register FGAC_AGENTEX_AUTH_SPARK flag
scaleapi/scaleapi scaleapi/scaleapi#145044 add cancel to SGP's AgentexOperation enum + role map
scaleapi/agentex scaleapi/agentex#353 agentex-auth per-account routing + cancel op
scaleapi/scale-agentex this PR task creator audit columns + FGAC dual-write + flag
scaleapi/scale-agentex #249 per-RPC operation rewire + 404/403 wrap

Summary

Commit 1 — passive audit columns

  • Adds creator_user_id and creator_service_account_id columns to the tasks table, populated from the request principal in AgentTaskService.create_task. Best-effort (NULLable; see caveat below).
  • Adds a CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL)) constraint (ck_tasks_at_most_one_creator) to enforce at-most-one creator type at the DB layer.
  • Adds partial indexes ix_tasks_creator_user_id and ix_tasks_creator_service_account_id (CREATE INDEX CONCURRENTLY ... WHERE column IS NOT NULL) for future "tasks created by X" lookups.

Commit 2 — FGAC dual-write call sites + flag

  • Adds an FGAC_TASKS_DUAL_WRITE env-var flag, injected into AgentTaskService via FastAPI DI. Gates the dual-write behavior end-to-end.
  • create_task calls register_resource(task, parent=agent) on the authorization service after the Postgres row persists, so the task is registered with tenant + owner + parent_agent tuples atomically (via agentex-auth's /v1/authz/register, landed on agentex-auth main via #354).
  • delete_task calls deregister_resource(task) after the Postgres delete. Pre-resolves the task id by name first so the post-delete deregister doesn't race the lookup.
  • Both call sites share a _dual_write_with_retry(op_name, do_call, task_id) helper. Retries AuthenticationServiceUnavailableError / AuthenticationGatewayError with exponential backoff + jitter (3 retries → 4 total attempts max), mirroring AgentsACPUseCase.grant_with_retry. Non-transient exceptions are not retried.
  • Emits Datadog metrics (task_fgac_dual_write.attempt|success|retry|failure) tagged with op:register|deregister and exception_class:<name> on failure — the rollout signal for AGX1-291's operator runbook.
  • The parent= kwarg matches agentex-auth's wire schema (set by #354); the proxy adapter serializes it to the JSON field parent.

Migration safety

  • ALTER TABLE ... ADD CONSTRAINT ... NOT VALID + ALTER TABLE ... VALIDATE CONSTRAINT — splits the operation so the brief ACCESS EXCLUSIVE lock doesn't have to wait on an existence scan. tasks is high-write; a CHECK addition without NOT VALID would queue behind in-flight transactions and block readers until released.
  • Indexes are CREATE INDEX CONCURRENTLY in an autocommit_block.
  • Migration revision: a1f73ada66c5 (add_task_creator_columns). down_revision is 6c942325c828 (adding_task_cleaned_at, the current alembic head on main); migration_history.txt regenerated via alembic history. The ORM-side CheckConstraint in orm.py matches the DB-side (same constraint name + predicate).

Rollout

  • Flag-off (default): no behavior change. Audit columns populate but no FGAC tuples are written. Safe to merge and deploy.
  • Flag-on: register_resource and deregister_resource fire on create/delete. If they fail after retries, the Postgres row is still the durable record — orphan auth tuples can be cleaned up out of band per the AGX1-291 operator runbook using the creator-audit columns to identify them.
  • Operator rollout assumes a redeploy cycles pods; the flag is read once at DI-resolve time, so mid-process flips are intentionally invisible.

Audit-trail caveat

Creator attribution is best-effort: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs AgentTaskService without request.state.principal_context) leave both columns NULL. The CHECK constraint allows both-NULL, and test_no_resolvable_creator_leaves_both_columns_null exercises this path.

What changed

  • database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py (new): NOT VALID-pattern migration. down_revision = "6c942325c828".
  • src/adapters/orm.py: declarative CheckConstraint mirroring the DB constraint.
  • src/domain/entities/tasks.py: new optional fields on TaskEntity.
  • src/domain/services/task_service.py:
    • _principal_field helper (handles dict-vs-pydantic principal shape from the authn proxy).
    • create_task reads creator_user_id / creator_service_account_id from principal context.
    • AgentTaskService.__init__ takes dual_write_enabled: DEnvironmentVariable(EnvVarKeys.FGAC_TASKS_DUAL_WRITE).
    • _dual_write_with_retry(op_name, do_call, task_id) keyed by op name; reused from both call sites.
    • Dual-write call sites use parent= to match agentex-auth's wire schema.
  • src/adapters/authorization/adapter_agentex_authz_proxy.py: forwards to agentex-auth's /v1/authz/register (JSON body field parent) and /deregister.
  • src/adapters/authorization/port.py: register_resource(... parent: AgentexResource | None = None).
  • src/config/environment_variables.py: new FGAC_TASKS_DUAL_WRITE key.
  • Tests:
    • test_task_audit_columns.py — testcontainers Postgres integration tests for the audit columns (creator population, mutual-exclusion CHECK, both-NULL allowed).
    • test_task_fgac_dual_write.py — register-on-create, deregister-on-delete, flag-off skip, transient retry-and-succeed (both register and deregister), retry exhaustion propagating with the Postgres row preserved, and the name-route ItemDoesNotExist swallow. Assertions on the parent kwarg align with agentex-auth's wire schema.
    • Existing unit/integration tests updated for the new dual_write_enabled constructor parameter.

Test plan

  • migration_lint.py — clean.
  • Ruff + ruff-format + alembic migration-safety lint clean (pre-commit hooks).
  • test_task_audit_columns.py — passes locally via testcontainers.
  • test_task_fgac_dual_write.py — collects cleanly; runs in CI integration suite.
  • Manual: deploy to staging with flag off, confirm \d tasks shows new columns + constraint + indexes; flip flag on for one account, confirm task_fgac_dual_write.success fires.

Greptile Summary

This PR adds task creator audit columns (creator_user_id, creator_service_account_id) to the tasks table and implements an operator-gated FGAC dual-write path that registers/deregisters tasks with the agentex-auth authorization graph on create/delete.

  • Audit columns: NULLable columns populated from the request principal on create_task; a CHECK constraint enforces at-most-one creator type; partial CONCURRENTLY indexes scope to non-NULL rows only.
  • Dual-write: Gated behind FGAC_TASKS_DUAL_WRITE (off by default); _dual_write_with_retry provides exponential-backoff retry for transient auth errors, emitting Datadog counters for rollout observability; the Postgres row is the durable record regardless of auth-graph outcome.
  • _UNSET sentinel refactor: Replaces the ... sentinel in AuthorizationService methods with a named _UNSET object; also adds register_resource/deregister_resource delegating to the new gateway methods.

Confidence Score: 5/5

Safe to merge: flag defaults to off, no behavior change on deploy; audit columns are NULLable and additive; migration uses CONCURRENTLY indexes and the NOT VALID/VALIDATE split inside an autocommit_block.

The dual-write path is completely gated and does not affect any existing code path when the flag is off. The Postgres row is the durable record regardless of auth-graph outcome, and the retry/metric pattern mirrors the existing codebase. Tests are comprehensive, covering both call sites, flag-off, retry-and-succeed, and exhaustion scenarios.

No files require special attention beyond what has already been discussed in earlier review threads.

Important Files Changed

Filename Overview
agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py New migration adding creator_user_id/creator_service_account_id columns, partial CONCURRENTLY indexes, and a NOT VALID + VALIDATE CHECK constraint pattern. CONCURRENTLY indexes are correctly placed in an autocommit_block; the NOT VALID/VALIDATE pair is also inside an autocommit_block (each op.execute in autocommit mode is its own transaction).
agentex/src/domain/services/task_service.py Core dual-write logic: adds creator audit-column population on create_task, _dual_write_with_retry helper with exponential backoff, register on create and deregister on delete gated by flag. Logic is sound; minor style nit on inline jitter constant.
agentex/src/adapters/authorization/adapter_agentex_authz_proxy.py Adds register_resource and deregister_resource methods forwarding to /v1/authz/register and /v1/authz/deregister. Pattern (principal passed raw, resource.model_dump()) is consistent with all existing methods in this file.
agentex/src/domain/services/authorization_service.py Replaces the ... sentinel with a named _UNSET object (cleaner) and adds register_resource/deregister_resource delegating to the gateway with bypass/principal-context handling consistent with existing grant/revoke.
agentex/tests/integration/use_cases/test_task_fgac_dual_write.py Integration tests covering register-on-create, deregister-on-delete, flag-off skip, transient retry with eventual success (both paths), retry exhaustion with row persistence, and missing-name error contract preservation. Comprehensive coverage.

Sequence Diagram

sequenceDiagram
    participant C as API Caller
    participant TS as AgentTaskService
    participant TR as TaskRepository (Postgres)
    participant AS as AuthorizationService
    participant GW as AgentexAuthProxy

    C->>TS: create_task(agent, name, ...)
    TS->>TS: read principal_context → creator_user_id / creator_service_account_id
    TS->>TR: repository.create(task + creator columns)
    TR-->>TS: task_entity (committed)
    alt "FGAC_TASKS_DUAL_WRITE=on"
        TS->>AS: "register_resource(task, parent=agent)"
        loop retry up to 3x on transient error
            AS->>GW: POST /v1/authz/register
            GW-->>AS: 200 OK / transient error
        end
        AS-->>TS: success / exhausted → raise
    end
    TS-->>C: task_entity

    C->>TS: delete_task(id or name)
    alt name-only + flag on
        TS->>TR: get(name) → task.id
    end
    TS->>TR: repository.delete(id, name)
    TR-->>TS: deleted
    alt "FGAC_TASKS_DUAL_WRITE=on and id resolved"
        TS->>AS: deregister_resource(task)
        loop retry up to 3x on transient error
            AS->>GW: POST /v1/authz/deregister
            GW-->>AS: 200 OK / transient error
        end
    end
    TS-->>C: done
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
agentex/src/domain/services/task_service.py:592
**Inline jitter ceiling is an unnamed magic number**

The `0.1` jitter ceiling in the backoff formula is an inline literal while the neighbouring `_REGISTER_MAX_RETRIES` and `_REGISTER_BASE_BACKOFF_SECONDS` are named module-level constants. A reviewer reading `random.uniform(0, 0.1)` has to guess what `0.1` represents; a named constant like `_REGISTER_JITTER_SECONDS` would make the intent explicit and keep the value change in one place if the retry budget is ever tuned.

Reviews (4): Last reviewed commit: "feat(AGX1-274): task FGAC dual-write cal..." | Re-trigger Greptile

@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 13fe4b2 to 7486e5a Compare May 26, 2026 20:22
@asherfink asherfink changed the title feat(AGX1-274): dual-write tasks to spark-authz behind FGAC_TASKS_DUAL_WRITE flag feat(AGX1-274): record task creator identity and FGAC migration safety May 26, 2026
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 7486e5a to b9cb26b Compare May 26, 2026 20:56
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from ad1e980 to 3a06be8 Compare May 27, 2026 21:15
@asherfink asherfink marked this pull request as ready for review May 27, 2026 21:59
@asherfink asherfink requested a review from a team as a code owner May 27, 2026 21:59
Comment thread agentex/src/domain/services/task_service.py
dm36 added a commit that referenced this pull request May 28, 2026
…AGENT_API_KEYS_DUAL_WRITE flag

Mirrors the AGX1-274 task dual-write pattern (PR #246) for agent_api_keys.

- Adds creator_user_id / creator_service_account_id / spark_authz_zedtoken
  columns to agent_api_keys, with CHECK constraint and concurrent indexes.
- On create, when FGAC_AGENT_API_KEYS_DUAL_WRITE is enabled for the caller's
  account, calls authorization_service.grant(AgentexResource.api_key(id))
  BEFORE the Postgres write. Grant failure aborts the create.
- On delete, best-effort revoke after the Postgres delete. Failures are
  logged but do not block the delete.
- Adds AgentexResourceType.api_key and AgentexResource.api_key(...) factory.
- Creates src/utils/feature_flags.py with both FGAC_TASKS_DUAL_WRITE and
  FGAC_AGENT_API_KEYS_DUAL_WRITE (file does not exist on main yet; if PR #246
  lands first this becomes a rebase concern).

Structural divergence from tasks: agent_api_keys have no service layer, so
the dual-write logic lives in AgentAPIKeysUseCase rather than a separate
service. This keeps the call site simple and avoids inventing a new layer.

Route layer (read-side auth checks) is out of scope; that's PR B (AGX1-273).
agentex-auth spark_mapping.py update is a sibling-repo concern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@asherfink asherfink marked this pull request as draft May 28, 2026 17:59
asherfink added a commit that referenced this pull request May 28, 2026
…n creation

Adds two nullable creator-audit columns to the `tasks` table —
`creator_user_id` and `creator_service_account_id` — populated from
the request principal in `AgentTaskService.create_task`. A CHECK
constraint `ck_tasks_at_most_one_creator` enforces that at most one
of the two is set; partial indexes back future "tasks created by X"
lookups.

Online migration: the CHECK is added `NOT VALID` then
`VALIDATE`d separately so the brief ACCESS EXCLUSIVE lock doesn't
have to wait on an existence scan. `tasks` is a high-write table;
a vanilla CHECK addition would queue behind in-flight transactions
and block readers until released. Indexes use
`CREATE INDEX CONCURRENTLY` inside `autocommit_block`.

Best-effort attribution: tasks created outside an HTTP request
context (Temporal activities, background workers, any path that
constructs `AgentTaskService` without `request.state.principal_context`)
leave both columns NULL. The CHECK constraint allows both-NULL,
and an integration test exercises the no-resolvable-creator path.

These columns are how the AGX1-291 operator runbook identifies
orphan rows for backfill when the dual-write call sites added in
the next commit fail under load.

Part of the AGX1-264 stack: scaleapi/scaleapi NEW2
(per-account FF endpoint) → scaleapi/agentex#353 (agentex-auth
routing + cancel) → this PR → #249 (per-RPC
route migration). Two commits land together in #246; this one is
the schema/audit change and is independent of the dual-write call
sites.
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 3a06be8 to f7910e3 Compare May 28, 2026 20:08
asherfink added a commit that referenced this pull request May 28, 2026
Rewires the operation literal sent to agentex-auth on task RPC
routes so each RPC checks the permission that actually matches its
side effect, instead of using `execute` everywhere:

- `MESSAGE_SEND` / `EVENT_SEND` → `update`
- `TASK_CANCEL` → `cancel`
- `TASK_CREATE` stays `create`
- Unknown `AgentRPCMethod` values now raise `NotImplementedError`
  rather than falling through authz-free (defense-in-depth: a new
  RPC must be explicitly wired before it can dispatch).

The same `execute → update` swap is applied across `messages.py`,
`checkpoints.py`, and `states.py` so the editor role can perform
routine mutations without needing owner. The task SpiceDB schema
defines `permission update = (editor + owner) & internal_tenant_gate`,
so leaving these on `execute` (owner-only) would lock editors out
of normal flows.

Adds `check_task_or_collapse_to_404` in
`src/utils/task_authorization.py` and routes every task-resource
denial path through it: path id, query id, body id, and the name
surface in `authorization_shortcuts.py`. `tasks.name` is globally
unique, so a 403/404 split on the name route would let any
authenticated caller probe the whole system for task existence —
collapsing both denial cases into 404 closes that leak at the cost
of an in-tenant UX regression on permission-gap updates (tracked
under AGX1-290).

The `MESSAGE_SEND` task-name branch is restructured to
`try/else`: a denied update on an existing task must surface as 404
and NOT fall through to the create check, which would promote
"denied update" into create access.

Cross-repo wire dependency: the `update` and `cancel` literals
must resolve against the existing OWNER grant in SGP's task
permission schema before this deploys, otherwise every in-flight
agent's RPCs break at deploy time. Held behind that verification.

Part of the AGX1-264 stack: scaleapi/scaleapi NEW2
(per-account FF endpoint) → scaleapi/agentex#353 (agentex-auth
routing + cancel) → #246 (task FGAC
dual-write + audit columns) → this PR.
@asherfink asherfink marked this pull request as ready for review May 28, 2026 20:32
…n creation

Adds two nullable creator-audit columns to the `tasks` table —
`creator_user_id` and `creator_service_account_id` — populated from
the request principal in `AgentTaskService.create_task`. A CHECK
constraint `ck_tasks_at_most_one_creator` enforces that at most one
of the two is set; partial indexes back future "tasks created by X"
lookups.

Online migration: the CHECK is added `NOT VALID` then
`VALIDATE`d separately so the brief ACCESS EXCLUSIVE lock doesn't
have to wait on an existence scan. `tasks` is a high-write table;
a vanilla CHECK addition would queue behind in-flight transactions
and block readers until released. Indexes use
`CREATE INDEX CONCURRENTLY` inside `autocommit_block`.

Best-effort attribution: tasks created outside an HTTP request
context (Temporal activities, background workers, any path that
constructs `AgentTaskService` without `request.state.principal_context`)
leave both columns NULL. The CHECK constraint allows both-NULL,
and an integration test exercises the no-resolvable-creator path.

These columns are how the AGX1-291 operator runbook identifies
orphan rows for backfill when the dual-write call sites added in
the next commit fail under load.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
this PR -> #249 (per-RPC route migration).
Two commits land together in #246; this one is the schema/audit
change and is independent of the dual-write call sites in the
next commit.
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from f7910e3 to 31d3d40 Compare May 28, 2026 22:26
asherfink added a commit that referenced this pull request May 28, 2026
Rewires the operation literal sent to agentex-auth on task RPC
routes so each RPC checks the permission that actually matches its
side effect, instead of using `execute` everywhere:

- `MESSAGE_SEND` / `EVENT_SEND` -> `update`
- `TASK_CANCEL` -> `cancel`
- `TASK_CREATE` stays `create`
- Unknown `AgentRPCMethod` values now raise `NotImplementedError`
  rather than falling through authz-free (defense-in-depth: a new
  RPC must be explicitly wired before it can dispatch).

The same `execute -> update` swap is applied across `messages.py`,
`checkpoints.py`, and `states.py` so the editor role can perform
routine mutations without needing owner. The task SpiceDB schema
defines `permission update = (editor + owner) & internal_tenant_gate`,
so leaving these on `execute` (owner-only) would lock editors out
of normal flows.

Adds `check_task_or_collapse_to_404` in
`src/utils/task_authorization.py` and routes every task-resource
denial path through it: path id, query id, body id, and the name
surface in `authorization_shortcuts.py`. `tasks.name` is globally
unique, so a 403/404 split on the name route would let any
authenticated caller probe the whole system for task existence —
collapsing both denial cases into 404 closes that leak at the cost
of an in-tenant UX regression on permission-gap updates (tracked
under AGX1-290).

The `MESSAGE_SEND` task-name branch is restructured to
`try/else`: a denied update on an existing task must surface as 404
and NOT fall through to the create check, which would promote
"denied update" into create access.

Cross-repo wire dependency: the `update` and `cancel` literals
must resolve against the existing OWNER grant in SGP's task
permission schema before this deploys. `update` is already in
SGP's `AgentexOperation` enum; `cancel` is added by scaleapi/scaleapi
sgp-agentex-cancel-enum. Held behind that PR shipping everywhere.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
#246 (task FGAC dual-write + audit columns)
-> this PR.
@dm36
Copy link
Copy Markdown

dm36 commented May 28, 2026

Some tests are failing:

=========================== short test summary info ============================
FAILED tests/integration/use_cases/test_task_fgac_dual_write.py::TestTaskDualWrite::test_delete_task_deregisters - src.domain.exceptions.ServiceError: Invalid input resulted in constraint violation: (sqlalchemy.dialects.postgresql.asyncpg.IntegrityError) <class 'asyncpg.exceptions.ForeignKeyViolationError'>: update or delete on table "tasks" violates foreign key constraint "task_agents_task_id_fkey" on table "task_agents"
DETAIL:  Key (id)=(b46e09f1-0b9a-4618-a1ef-fdf12d865fd8) is still referenced from table "task_agents".
[SQL: DELETE FROM tasks WHERE tasks.id = $1::VARCHAR]
[parameters: ('b46e09f1-0b9a-4618-a1ef-fdf12d865fd8',)]
(Background on this error at: https://sqlalche.me/e/20/gkpj)
FAILED tests/integration/use_cases/test_task_fgac_dual_write.py::TestTaskDualWrite::test_flag_off_skips_register_and_deregister - src.domain.exceptions.ServiceError: Invalid input resulted in constraint violation: (sqlalchemy.dialects.postgresql.asyncpg.IntegrityError) <class 'asyncpg.exceptions.ForeignKeyViolationError'>: update or delete on table "tasks" violates foreign key constraint "task_agents_task_id_fkey" on table "task_agents"
DETAIL:  Key (id)=(453d1be2-9d4f-44fc-aff5-7d8fea1f6565) is still referenced from table "task_agents".
[SQL: DELETE FROM tasks WHERE tasks.id = $1::VARCHAR]
[parameters: ('453d1be2-9d4f-44fc-aff5-7d8fea1f6565',)]
(Background on this error at: https://sqlalche.me/e/20/gkpj)
FAILED tests/integration/use_cases/test_task_fgac_dual_write.py::TestTaskDualWrite::test_transient_unavailable_on_deregister_retries_then_succeeds - src.domain.exceptions.ServiceError: Invalid input resulted in constraint violation: (sqlalchemy.dialects.postgresql.asyncpg.IntegrityError) <class 'asyncpg.exceptions.ForeignKeyViolationError'>: update or delete on table "tasks" violates foreign key constraint "task_agents_task_id_fkey" on table "task_agents"
DETAIL:  Key (id)=(14932765-0fcb-4065-b219-999f61f3847c) is still referenced from table "task_agents".
[SQL: DELETE FROM tasks WHERE tasks.id = $1::VARCHAR]
[parameters: ('14932765-0fcb-4065-b219-999f61f3847c',)]
(Background on this error at: https://sqlalche.me/e/20/gkpj)
===== 3 failed, 81 passed, 449 deselected, 29 warnings in 78.38s (0:01:18) =====

…TE flag

Wires `register_resource` / `deregister_resource` into
`AgentTaskService.create_task` / `delete_task`, gated by a new
`FGAC_TASKS_DUAL_WRITE` env-var (off by default; resolved at
DI-resolve time so mid-process flips are intentionally invisible —
rollout assumes a redeploy cycles pods).

- `create_task`: after the Postgres row persists, register the
  task in the authorization graph with `parent=agent` so the
  tenant + owner + parent_agent tuples are written atomically.
- `delete_task`: pre-resolves the task id by name before the
  Postgres delete (lookup-after-delete would race), then
  deregisters once the row is gone. The name-lookup
  `ItemDoesNotExist` is swallowed so the subsequent `delete()`
  surfaces its own native error — flipping the flag must not
  change the error contract for missing tasks.
- Both call sites share `_dual_write_with_retry(op_name, do_call,
  task_id)`, which retries transient
  `AuthenticationServiceUnavailableError` /
  `AuthenticationGatewayError` with exponential backoff + jitter
  (3 retries -> 4 attempts max). Mirrors
  `agents_acp_use_case.grant_with_retry`, but with no `fail_task`
  fallback: the Postgres row is the durable record and orphan auth
  tuples are preferable to losing the task. The AGX1-291 operator
  runbook covers backfill using the creator-audit columns added in
  the parent commit.
- Emits `task_fgac_dual_write.{attempt,success,retry,failure}`
  statsd counters (tagged `op:register|deregister` and
  `exception_class` on failure) — the rollout signal for the
  FGAC_TASKS_DUAL_WRITE flip dashboard.

The `Port` interface gains `register_resource` /
`deregister_resource`, and the agentex-auth proxy adapter calls
`POST /v1/authz/register` and `POST /v1/authz/deregister`. The
endpoints themselves already live on agentex-auth `main` via #354;
per-account routing across them is set by scaleapi/agentex#353.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
this PR -> #249 (per-RPC route migration).
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 31d3d40 to a9bc09a Compare May 29, 2026 21:17
@asherfink
Copy link
Copy Markdown
Author

updated, tests all pass now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants